This document was most recently knit on 2021-01-31.

Introduction

In the last tutorial, we covered many of the fundamental ideas underlying the grammar of graphics: 1) aesthetics and how to map your data to aesthetics, 2) geometries and how to layer them, and 3) how geometries inherit aesthetics from the initial ggplot call. When you develop a good grasp on these fundamental ideas, you have enough knowledge to figure out how to make nearly any visualization you want (especially now that you know what keywords to search for). In today’s tutorial, we’ll learn how to give your plots a little more polish. The goal is to make publication-ready plots, which (in theory) won’t require further editing in external software like Illustrator or Inkscape.

As you work through this tutorial’s material, you may find yourself having to review those fundamental ideas from last time. That’s normal, and perfectly fine. You shouldn’t expect to become an expert after working through a single tutorial. I’ll try to provide some hints and reminders in today’s tutorial, which will hopefully let you review and scaffold from your existing knowledge.

Let’s go ahead and load some of the same data we were working with last time.

library(tidyverse)
library(lubridate)
library(janitor)
library(here)

covid <- here("Data", "time_series_covid19_confirmed_US.csv") %>%
  read_csv() %>%
  clean_names() %>%
  select(-c(uid:fips, country_region:combined_key)) %>%
  rename(county = admin2, state = province_state) %>%
  pivot_longer(cols = -c(county, state), names_to = "date", values_to = "cases") %>%
  mutate(date = str_remove(date, "x"),
         date = mdy(date)) %>%
  group_by(state, date) %>%
  summarise(cases = sum(cases)) %>%
  ungroup()

elections <- here("Data", "countypres_2000-2016.csv") %>%
  read_csv() %>%
  filter(year == 2016) %>%
  filter(party %in% c("democrat", "republican")) %>%
  group_by(state, candidate) %>%
  summarise(candidatevotes = sum(candidatevotes, na.rm=T)) %>%
  group_by(state) %>%
  mutate(lean_democrat = candidatevotes / first(candidatevotes)) %>%
  filter(candidate == "Hillary Clinton") %>%
  ungroup() %>%
  select(state, lean_democrat)

regions <- here("Data", "state_region_division.csv") %>%
  read_csv()

Axis labels

You may have noticed that when you don’t supply ggplot with axis labels, it defaults to using your data’s variable names:

covid %>%
  ggplot(mapping = aes(x=date, y=cases))

In this case, it doesn’t look terrible, though we probably would prefer the axis labels to be capitalized. However, we’ll often be working with variable names that look like count_normalized_by_pop. That’s considerably less pretty, and I doubt you’d want to publish a figure with that exact axis label. Here’s the simplest way to change the axis labels. As a bonus, I’ve also shown how to add a title and subtitle to your plot.

covid %>%
  ggplot(mapping = aes(x=date, y=cases)) +
  xlab("Time") +
  ylab("Raw case numbers") +
  ggtitle(label = "covid-19 cases over time",
          subtitle = "(not normalized by state population)")

Facets

Last time, we explored how to color-code our plots by mapping our data to the aesthetics color and fill. That worked okay when we were only comparing a few states to each other, but it was near-impossible to tell colors apart when we were plotting all 50 states (plus DC, territories, and protectorates). This is an example of overplotting, i.e. packing so much information into a single plot that it becomes hard to read. Which defeats the whole purpose of graphing out data! To illustrate, let’s create a visualization that is clearly overplotted. Try picking out Tennessee from Texas…

covid %>%
  ggplot(mapping = aes(x=date, y=cases, color=state)) +
  geom_line() +
  theme(legend.position = "bottom")

Since states are discrete categories, it could potentially be useful to separate each state’s data out into its own panel. That can be accomplished using a facet. In this example, we’ll use the function facet_wrap. You can see that we specify one argument, ~state.

The squiggly symbol ~ is called a tilde, and can be found (on a US English keyboard at least) above the tab key and to the left of the number 1. When we start learning about statistics, we’ll see the tilde very frequently. For example, we might be interested in using a penguin’s flipper length to predict its body mass. We would write that formula as body_mass ~ flipper_length. Colloquially, we might say that we’re interested in predicting the penguin’s body mass by its flipper length. The same interpretation holds for our use of facet_wrap: we are faceting the data by state.

covid %>%
  ggplot(mapping = aes(x=date, y=cases)) +
  geom_line() +
  facet_wrap(~state) +
  theme(legend.position = "bottom")

By looking at the documentation (?facet_wrap), you could see how to specify the exact number of columns/rows ggplot generates before it wraps the facets (hence the name of the function).

What if we have two discrete variables we want to facet by? For example, we could be interested in examining covid counts by geographical region and political ideology (for illustration, we’ll convert the continuous variable lean_democrat into a discrete variable). Well, we can take that verbal statement and translate it into a formula, and ggplot will happily accept it.

As you’re typing out this code, make sure you can answer the following questions: 1) Why have we added the group aesthetic in the initial ggplot call, and what happens if we take it out? 2) Recall that the variable lean_democrat is a ratio. Why have we defined the new variable ideology in this way? 3) In the previous plot, we faceted by state. So why haven’t we specified the formula to be ~state + ideology in this plot? 4) In our call to facet_wrap, why have we specified ncol = 2? Why not a different number? 5) What would happen if we made the formula ~ideology + region instead, and might this change motivate us to specify a different ncol?

covid %>%
  inner_join(elections) %>%
  mutate(ideology = if_else(
    lean_democrat > 1,
    "Democrat-leaning",
    "Republican-leaning")) %>%
  inner_join(regions) %>%
  ggplot(mapping = aes(x=date, y=cases, group=state)) +
  geom_line() +
  facet_wrap(~region + ideology, ncol = 2)

It would be totally fair if you looked at this plot and declared it ugly. When you ask facet_wrap to give you multiple facets, it stacks the labels on top of each other. This takes up a lot of space, and doesn’t necessarily give you the prettiest output. As an alternative, you could use facet_grid:

covid %>%
  inner_join(elections) %>%
  mutate(ideology = if_else(
    lean_democrat > 1,
    "Democrat-leaning",
    "Republican-leaning")) %>%
  inner_join(regions) %>%
  ggplot(mapping = aes(x=date, y=cases, group=state)) +
  geom_line() +
  facet_grid(rows = vars(region), cols = vars(ideology))

In other people’s older code, you might see the syntax row_variable ~ column_variable instead. This style is being phased out for three reasons. First, it was hard for people to remember whether LHS (left-hand side) corresponded to rows or columns. Second, if you were plotting only a single facet, your code would look like chickenscratch: . ~ column_variable. Third, when we were supplying a formula to facet_wrap, we understood that ~state implicitly meant covid counts ~ state (or in plain English, covid counts by state). So in the past, facet_grid forced us to specify something that looked like a formula, but was totally inconsistent with our understanding of how formulas work.

covid %>%
  inner_join(elections) %>%
  mutate(ideology = if_else(
    lean_democrat > 1,
    "Democrat-leaning",
    "Republican-leaning")) %>%
  inner_join(regions) %>%
  ggplot(mapping = aes(x=date, y=cases, group=state)) +
  geom_line() +
  facet_grid(region ~ ideology)

Scales and legends

We haven’t really emphasized this point so far, but every aesthetic has a scale associated with it. Let’s consider the simplest case of the x and y-axis aesthetics. We’re working with a continuous x-axis (time) and a continuous y-axis (case counts). There are two key assumptions we’ve made when plotting these: 1) that we do actually want those scales to be continuous, and 2) that we want the axes to be plotted on a linear scale. In the code below, we make this unstated assumption explicit by asking ggplot to use a continuous y-axis scale. While we’re at it, we’ll also modify the axis title. Recall that we could also do this using ylab.

covid %>%
  ggplot(aes(x=date, y=cases, group=state)) +
  geom_line() +
  scale_y_continuous(name = "Raw covid-19 case counts")

But if we wanted to, we could specify that an aesthetic be represented by a completely different kind of scale. For example, when working with continuous data that has a very large range, it can make sense to plot a logarithmic scale instead of a linear one.

covid %>%
  ggplot(aes(x=date, y=cases, group=state)) +
  geom_line() +
  scale_y_log10(name = "log-transformed covid-19 case counts")
## Warning: Transformation introduced infinite values in continuous y-axis

Depending on our visualization needs, we could use a variety of scales. Another common scale transformation is to a square-root scale. If (for some reason) we wanted to sort continuous values into bins, we could do that too. Though there’s virtually no reason why we would want to visualize these particular data in this way, we could even flip the y-axis scale so that the data are plotted upside down.

covid %>%
  ggplot(aes(x=date, y=cases, group=state)) +
  geom_line() +
  scale_y_sqrt(name = "Square-root-transformed covid-19 case counts")

covid %>%
  ggplot(aes(x=date, y=cases, group=state)) +
  geom_line() +
  scale_y_binned(name = "Binned covid-19 case counts")

covid %>%
  ggplot(aes(x=date, y=cases, group=state)) +
  geom_line() +
  scale_y_reverse(name = "Inexplicably-plotted covid-19 case counts")

Discrete color scales

This basic logic extends very naturally to other aesthetics. For example, let’s examine color-coding. We could color-code the lines according to geographic region. By default, ggplot knows that region is a discrete (i.e., categorical) variable, so it uses a discrete color scale. We can make this assumption explicit by writing the following. Note again that we can change the name of the legend from the variable name region by specifying a name of our choosing.

covid %>%
  inner_join(regions) %>%
  ggplot(mapping = aes(x=date, y=cases, color=region, group=state)) +
  geom_line() +
  scale_color_discrete(name = "Census-designated region:")

If you look at this plot and find it ugly, you are in good company. I’ve convinced that ggplot’s default color palette is deliberately designed to revolt people into specifying their own (better) color scales. More importantly, we should also note that the default palette is not a color palette that is accessible to people with certain forms of color-blindness. Less importantly, if we printed off the plot on a black-and-white printer, we wouldn’t be able to tell the colors apart anymore.

So perhaps we’d like to use an alternative color scale that’s prettier and more accessible. One built-in option is viridis, which is engineered to be perceptually uniform, accessible to viewers with the most common form of color blindness, and convertible to black-and-white. The d at the end of scale_color_viridis_d refers to the fact that our color aesthetic is being mapped to discrete data. There are a couple of viridis palettes: viridis, magma, inferno, plasma, and cividis. You can see that I prefer plasma, and that I even specify a custom subset of the full scale to use (to see why, try removing those arguments). You will develop your own preferences with experience.

covid %>%
  inner_join(regions) %>%
  ggplot(mapping = aes(x=date, y=cases, color=region, group=state)) +
  geom_line() +
  scale_color_viridis_d(name = "Census-designated region:",
                        begin = 0.1, end = 0.8, option = "plasma")

Another built-in option is color_brewer. The best way to understand the brewer is to play with it online. Once you’ve played around with it, the arguments are pretty straightforward to understand:

covid %>%
  inner_join(regions) %>%
  ggplot(mapping = aes(x=date, y=cases, color=region, group=state)) +
  geom_line() +
  scale_color_brewer(name = "Census-designated region:",
                     palette = "GnBu")

Even though the Brewer is convenient, I don’t find myself using it very much. In the plot above, you can hardly see the lines for the Midwest. In practice, I go to the Color Brewer website, pick out colors that I actually like, then feed those values to scale_color_manual. You might say that the end result below looks bad. Truth be told, I’m actually not the biggest fan of the colors I picked out below. For starters, it’s not a colorblind-friendly palette. But at the least, I know it’s bad because I have poor artistic taste, not because the function automatically picked out “bad” colors on my behalf.

covid %>%
  inner_join(regions) %>%
  ggplot(mapping = aes(x=date, y=cases, color=region, group=state)) +
  geom_line() +
  scale_color_manual(name = "Census-designated region:",
                     values = c("#3182bd", "#de2d26", "#31a354", "#756bb1"))

Continuous color scales

So far, we’ve only worked with discrete color scales. But, the exact same kind of logic works with continuous color scales. For example, let’s color-code the lines according to how strongly a state leaned Democrat in the 2016 Presidential election. Note that DC is removed because it overwhelmingly preferred Clinton, such that including it made the rest of the plot unreadable.

covid %>%
  inner_join(elections) %>%
  filter(state != "District of Columbia") %>%
  ggplot(mapping = aes(x=date, y=cases, color=lean_democrat, group=state)) +
  geom_line() +
  scale_color_continuous(name = "Democrat-voting ratio:")

It’s often very hard to read continuous color scales. Remember how we’d previously used a y-axis that used a binned scale? That might not be very useful for case counts, but it is often useful for making our color scale more readable.

covid %>%
  inner_join(elections) %>%
  filter(state != "District of Columbia") %>%
  ggplot(mapping = aes(x=date, y=cases, color=lean_democrat, group=state)) +
  geom_line() +
  scale_color_binned(name = "Democrat-voting ratio:")

You can modify the palette of continuous color scales in (more-or-less) exactly the same way we modified discrete color scales. On your own, try using viridis, brewer, and manual color scales.

Other aesthetics

I won’t spend much time on this section. I only want to note that there are similar scales for other aesthetics, and reiterate that every aesthetic is represented by a scale even when it’s not explicitly specified. ggplot is usually pretty good about picking default scales, but just remember that you can take complete control at any time, for any aesthetic.

Coordinates

Conceptual background

In your history or math classes, you might remember having learned about the ancient Greek mathematician Euclid. His major accomplishment was deriving a system for describing geometric objects in terms of a small number of fundamental assumptions (known as axioms), which could be logically combined into theorems (statements that are true, as long as the axioms are true and have been combined without error). You may have heard the term “Euclidean geometry” used to describe this system of thinking.

When we create plots, we are interested in mapping data to geometries, and we have already spent considerable time considering this idea. But we’ve taken for granted that these geometries exist within a coordinate system. Let’s imagine that we draw a square on a balloon that hasn’t been inflated yet. We can describe the properties of that square quite well using Euclidean geometry. Then, we blow air into the balloon until it’s fully inflated. When we look at the drawing we’ve made, will it still look like a square? Not really, because we’ve warped the underlying space. And if we tried to use Euclidean geometry to analyze the warped square, we wouldn’t get very far.

For a moment, let’s return to the scales that modify the x- and y-axes. These are special because they modify the coordinate system of the entire plot. For example, if we use scale_y_sqrt to plot the data on a square-root-transformed y-axis, we’re no longer operating in a linear coordinate space. Instead, we’re now working in a nonlinear square-root coordinate space.

Most of the time, when we need to modify the x/y-axis scales, it’s easiest to do it through scale_x_whatever or scale_y_whatever. However, there are going to be times when you’ll want to modify the coordinate system directly.

Zooming

In our plots so far, we’ve been displaying all available case data, starting from January 22. But we know in hindsight that the first major wave didn’t occur until mid-March. We might therefore want to “zoom” into our data, plotting the case numbers starting in March instead of January. To do this, we can use the default coordinate system, coord_cartesian, and additionally feed that function an extra argument to limit what range of data is plotted. You may now be able to guess what the argument xlim does, and if not, consult the documentation (?coord_cartesian). Now, try seeing whether you can also zoom in on the y-axis.

covid %>%
  inner_join(regions) %>%
  ggplot(mapping = aes(x=date, y=cases, color=region, group=state)) +
  geom_line() +
  scale_color_viridis_d(name = "Census-designated region:",
                        begin = 0.1, end = 0.8, option = "plasma") +
  coord_cartesian(xlim = c(mdy("03/01/2020"), mdy("09/24/2020")))

Aspect ratio

In your Plots tab in RStudio, you’ve probably noticed that there’s a button labeled Zoom. If you click on this, it opens a new window with your plot. As you resize this window, the shape of your plot changes with it. So if you make your plot window into a square, your plot becomes more square-like. But, you might be interested in maintaining a constant aspect ratio, so that your plot will always have the same shape even if you change its size.

To illustrate, let’s compare the following two plots. Try opening up a new plot window and resizing the plot to ensure that it maintains a constant shape, even as you change its size.

covid %>%
  inner_join(regions) %>%
  ggplot(mapping = aes(x=date, y=cases, color=region, group=state)) +
  geom_line() +
  scale_color_viridis_d(name = "Census-designated region:",
                        begin = 0.1, end = 0.8, option = "plasma") +
  coord_fixed(ratio = 1/1000)

covid %>%
  inner_join(regions) %>%
  ggplot(mapping = aes(x=date, y=cases, color=region, group=state)) +
  geom_line() +
  scale_color_viridis_d(name = "Census-designated region:",
                        begin = 0.1, end = 0.8, option = "plasma") +
  coord_fixed(ratio = 1/10000)

Bonus: using a ratio of 1/10000, zoom in so that we begin plotting data from March onwards.

Polar coordinates

By default, ggplot will return a stacked bar when you ask it for a barplot, which makes it easy to compare the proportions of whatever’s being plotted. For example, we could look at the (raw!) counts in three states, and immediately get a sense for the relative share of cases in those states. Of course, as we’ve previously discussed, there are issues with plotting raw case numbers (e.g., because California obviously has a bigger population than Rhode Island), but this is just for the sake of illustration.

covid %>%
  filter(date == mdy("09/24/2020")) %>%
  filter(state %in% c("Tennessee", "California", "Rhode Island")) %>%
  ggplot(mapping = aes(x=date, y=cases, fill=state)) +
  geom_bar(stat = "identity")

Another popular way to visualize proportions is using a pie chart. Some people will argue that you should never use them, because humans are notoriously bad at judging polar angles. In contrast, people are notoriously good at judging line lengths. So, there’s something to be said about using barplots (stacked or otherwise) to visualize proportion data when you can.

If you ever do find yourself needing to make a pie chart, you can easily use the polar coordinate system to transform the Cartesian system into the polar system:

covid %>%
  filter(date == mdy("09/24/2020")) %>%
  filter(state %in% c("Tennessee", "California", "Rhode Island")) %>%
  ggplot(mapping = aes(x=date, y=cases, fill=state)) +
  geom_bar(stat = "identity") +
  coord_polar(theta = "y")

Themes

Pre-made themes

By default, ggplot uses a theme that has a grey background and white grid lines. Some people really like this. Personally, I don’t. I like plots that look very clean, and I don’t think the grey background (for example) is particularly clean-looking. Luckily for people like me, ggplot comes with a bunch of pre-made themes that suit a variety of preferences.

Let’s store a dataframe in our computer’s memory that has the regions dataset joined to the covid dataset. Typically, we do this at runtime via covid %>% inner_join(regions), but we’ll be using this dataframe a lot. Let’s save ourselves some typing and lessen our computers’ computing loads.

covid_regions <- covid %>%
  inner_join(regions)
## Joining, by = "state"

As usual, here’s the default theme (though we don’t typically call theme_grey explicitly).

covid_regions %>%
  ggplot(mapping = aes(x=date, y=cases, group=state)) +
  geom_line() +
  facet_grid(cols = vars(region)) +
  theme_grey()

Now, let’s just try running through other themes. Notice that each theme changes the background color, the color/existence of grid lines, and whether borders are drawn around facets.

covid_regions %>%
  ggplot(mapping = aes(x=date, y=cases, group=state)) +
  geom_line() +
  facet_grid(cols = vars(region)) +
  theme_bw()

covid_regions %>%
  ggplot(mapping = aes(x=date, y=cases, group=state)) +
  geom_line(color = "white") +
  facet_grid(cols = vars(region)) +
  theme_dark()

covid_regions %>%
  ggplot(mapping = aes(x=date, y=cases, group=state)) +
  geom_line() +
  facet_grid(cols = vars(region)) +
  theme_light()

covid_regions %>%
  ggplot(mapping = aes(x=date, y=cases, group=state)) +
  geom_line() +
  facet_grid(cols = vars(region)) +
  theme_classic()

covid_regions %>%
  ggplot(mapping = aes(x=date, y=cases, group=state)) +
  geom_line() +
  facet_grid(cols = vars(region)) +
  theme_linedraw()

covid_regions %>%
  ggplot(mapping = aes(x=date, y=cases, group=state)) +
  geom_line() +
  facet_grid(cols = vars(region)) +
  theme_minimal()

covid_regions %>%
  ggplot(mapping = aes(x=date, y=cases, group=state)) +
  geom_line() +
  facet_grid(cols = vars(region)) +
  theme_void()

Manual adjustments

Personally, I really like theme_bw and I use it in nearly all of my own plots. But, there are often times when I want/need to make manual adjustments to that pre-made template. For example, I often want to get rid of the grid lines. Sometimes, I want legends to be drawn below the plot, not beside it. I don’t really like that the default location of the plot title is left-justified, not centered. As with all other parts of ggplot, we can easily take full manual control over any thematic element. The full list of possible thematic adjustments is outside the scope of this tutorial, but can be viewed in the ggplot reference manual.

For the sake of example, we’ll try out the three things I mentioned. For reference, here’s the plot we get if we use all of the default thematic settings:

covid_regions %>%
  ggplot(mapping = aes(x=date, y=cases, color=region, group=state)) +
  ggtitle("covid-19 cases over time") +
  geom_line() +
  facet_grid(cols = vars(region)) +
  theme_bw()

And now, adding our manual adjustments:

covid_regions %>%
  ggplot(mapping = aes(x=date, y=cases, color=region, group=state)) +
  ggtitle("covid-19 cases over time") +
  geom_line() +
  facet_grid(cols = vars(region)) +
  theme_bw() +
  theme(
    panel.grid = element_blank(),
    legend.position = "bottom",
    plot.title = element_text(hjust = 0.5)
  )

Miscellany

Position adjustments

There are times when you might want to adjust where data are plotted. For example, the default barplot stacks groups on top of each other. You might instead want groups to be plotted beside each other. Once again, we might be interested in comparing case counts in California, Rhode Island, and Tennessee. Stacked barplots are decently good for visualizing proportions, but it’s hard to beat side-by-side comparisons. We can do this by specifying an additional argument, position, which specifies what position the geom should be drawn in. That argument takes a function that specifies how the position adjustment should be made. Here, we’ll use the function position_dodge, so named because the bars are trying to “dodge” each other.

covid %>%
  filter(date == mdy("09/24/2020")) %>%
  filter(state %in% c("Tennessee", "California", "Rhode Island")) %>%
  ggplot(mapping = aes(x=date, y=cases, fill=state)) +
  geom_bar(stat = "identity", position = position_dodge())

Position adjustments can also help when plotting overlapping datapoints. To illustrate, we’re going to make up a dataset where all of the datapoints are overlapping:

overlap <- tibble(
  x = rep(1, times = 20),
  y = rep(1, times = 20)
  )

When plotted, the 20 datapoints are visually compressed in a single point. This is bad because the viewer don’t know whether that point represents a single observation, or a thousand overlapping observations.

overlap %>%
  ggplot(aes(x, y)) +
  geom_point() +
  coord_cartesian(xlim = c(0.95, 1.05))

But if we make a position adjustment, such that we “jitter” the data, it becomes evident that there are many datapoints there. Note how we specify two arguments for position_jitter. What happens if we omit one or both?

overlap %>%
  ggplot(aes(x, y)) +
  geom_point(position = position_jitter(width = 0.01, height = 0)) +
  coord_cartesian(xlim = c(0.95, 1.05))

Lists + style consistency

When a ggplot object is assigned to a variable, it’s saved in memory as a list.

my_plot <- covid_regions %>%
  ggplot(mapping = aes(x=date, y=cases, group=state)) +
  ggtitle("covid-19 cases over time") +
  geom_line() +
  facet_grid(cols = vars(region))

my_plot

This is nice, because it means that we can create a list that can act as our own custom theme when we add it to a new plot:

theme_jae <- list(
  # Start off with the black-and-white theme
  theme_bw(),
  # Make adjustments that I always use in every one of my plots
  theme(
    panel.grid = element_blank(),
    plot.title = element_text(hjust = 0.5)
  )
)

covid_regions %>%
  ggplot(mapping = aes(x=date, y=cases, group=state)) +
  ggtitle("covid-19 cases over time") +
  geom_line() +
  facet_grid(cols = vars(region)) +
  theme_jae

Or, since we’ve already saved that plot in memory:

my_plot +
  theme_jae

Saving plots

Once you’ve created a plot that you’re happy with, you might want to save it to your hard drive. We can do this using ggsave. By default, this function “remembers” the most recent visualization created by ggplot, and saves that visualization. You can also specify the name of a ggplot assigned to a variable, and it’ll save that visualization instead.

Many of the most popular image formats are rasterized. For example, if you have a 500x500-pixel PNG file (or JPG or GIF or BMP…), the plot is saved as a collection of pixels that are located somewhere in that pixel space. If you resize that file so that it’s now 1000x1000 pixels, you’ll end up with a blurry image. Raster formats also utilize some form of compression, which means that there is some loss of information when the plot is saved as an image.

I strongly prefer to save my plots as vectorized images (if you’re familiar with ideas from linear algebra, this should already give you an idea of what this entails). Instead of characterizing the plot within pixel space, the plot is saved as a collection of elements (points, lines, curves, etc) inside vector space. Therefore, by scaling the vectors, you can resize your plot without any loss to image quality. The drawback is that vectorized images can be harder to work with. In my experience, using vectorized PDFs has been a happy medium.

A couple more technical notes for any people interested: DPI stands for “dots per inch” and refers to the resolution of an image if you were to print it. You might remember seeing old-fashioned newspapers where the printed images were clearly composed of many small dots. The more dots per inch, the higher the resolution, leading to better image quality. 300 is considered the minimum for print, and many publishers will insist that you upload 300 DPI images when your article goes to press. A dingbat font is basically one that maps keyboard inputs into icons instead of characters. They were used in the early days of computers when it was easier to render a font instead of displaying an image. I hate them because they make it completely impossible to edit a plot in external software like Adobe Illustrator. Unless you tell ggsave to turn off Dingbat use, I guarantee you that you’ll run into problems with “missing fonts”, visual elements scaling in an unexpected and unpredictable manner, and so on. Granted, the lion’s share of your plot editing should be happening directly in R via ggplot. But, in cases where you need to do a little further editing in external software, you’ll be glad that you didn’t use Dingbats.

ggsave(
  filename = here("Output", "ggsave_example.pdf"),
  plot = my_plot + theme_jae,
  width = 8,
  height = 4,
  units = "in",
  dpi = 300,
  useDingbats = FALSE
  )

The cheatsheet

Feeling overwhelmed yet? That’s completely normal. I use ggplot on a near-daily basis, and I still need to reference the documentation regularly. But now you understand something about the grammar of how a plot is assembled, and it’ll make it much easier for you to look up the right references.

I’ll also note that if you go to the online ggplot reference manual, there’s a section where you can find a cheatsheet for using ggplot.

Exercises

A general note: these are hard exercises, and you won’t necessarily find the answers in this tutorial. Learning to Google (or, if you really care about privacy, DuckDuckGo) answers to your coding questions is itself an essential skill. So, be patient with yourself. Struggling through these exercises will help you understand this material at a deeper level.

  1. You may remember having previously worked with survey data concerning people’s preferences for various Halloween candies. Load in candyhierarchy2017.csv from the data folder, tidy it up, and plot the data. Try splitting up the data by variables like gender, whether respondents went trick-or-treating, and whether respondents perceive The Dress as being black/blue or white/gold. You might find faceting useful here. You get bonus points if you’re able to make the plot spooky and/or Halloween-themed.

  2. The City of New York publishes data on popular baby names, with demographic information about race and gender. Load in Popular_Baby_Names.csv from the data folder and try visualizing the data in different ways. For example, you could track the popularity of a particular name over time. You could plot the ranking of popular names in a particular year. You could investigate what names are only popular within a particular demographic group, versus what names are popular across demographic groups.